Date: 2018-05-03 11:14:25
Ce pipeline consiste en une série d’étapes permettant d’obtenir des séquences analysables pour les études de communautés microbiennes. A l’origine construit pour les séquences 16S, nous l’utiliserons avec des séquences amplifiées de gène marqueur ITS (Fungi) obtenus grâce à un séquençage en paire Illumina MiSEQ 2x300 paires de bases.
Avant de débuter ce pipeline, il faut prendre quelques précautions:
This pipeline is designed in order to obtain analysable sequences for microbial community studies. Originally made for 16S sequences, we will use it here with amplified ITS sequences (Fungi) generated by Illumina MiSEQ 2x300 bp paired-end sequencing.
Before starting, we must take a few precautions:
Nous allons tout d’abord chargé la librairie DADA2. Vous devriez avoir la denière version: 1.6.0.
Puis nous allons créer une variable indiquant le chemin qui permettra d’accéder aux objets dont nous allons avoir besoin.
First, we’re going to load the DADA2 package. You should have the latest version : 1.6.0. Then we’re going to create a variable indicating the path which will allow to access the objects required for this pipeline.
library(dada2); packageVersion("dada2")
## Loading required package: Rcpp
## [1] '1.6.0'
path <- "data/ITS_sub/"
list.files(path)
## [1] "S10_R1.fastq" "S10_R2.fastq" "S11_R1.fastq" "S11_R2.fastq"
## [5] "S12_R1.fastq" "S12_R2.fastq" "S13_R1.fastq" "S13_R2.fastq"
## [9] "S14_R1.fastq" "S14_R2.fastq" "S15_R1.fastq" "S15_R2.fastq"
## [13] "S16_R1.fastq" "S16_R2.fastq" "S17_R1.fastq" "S17_R2.fastq"
## [17] "S18_R1.fastq" "S18_R2.fastq" "S1_R1.fastq" "S1_R2.fastq"
## [21] "S20_R1.fastq" "S20_R2.fastq" "S21_R1.fastq" "S21_R2.fastq"
## [25] "S22_R1.fastq" "S22_R2.fastq" "S23_R1.fastq" "S23_R2.fastq"
## [29] "S24_R1.fastq" "S24_R2.fastq" "S25_R1.fastq" "S25_R2.fastq"
## [33] "S26_R1.fastq" "S26_R2.fastq" "S27_R1.fastq" "S27_R2.fastq"
## [37] "S28_R1.fastq" "S28_R2.fastq" "S29_R1.fastq" "S29_R2.fastq"
## [41] "S2_R1.fastq" "S2_R2.fastq" "S30_R1.fastq" "S30_R2.fastq"
## [45] "S31_R1.fastq" "S31_R2.fastq" "S33_R1.fastq" "S33_R2.fastq"
## [49] "S34_R1.fastq" "S34_R2.fastq" "S35_R1.fastq" "S35_R2.fastq"
## [53] "S36_R1.fastq" "S36_R2.fastq" "S37_R1.fastq" "S37_R2.fastq"
## [57] "S39_R1.fastq" "S39_R2.fastq" "S3_R1.fastq" "S3_R2.fastq"
## [61] "S41_R1.fastq" "S41_R2.fastq" "S42_R1.fastq" "S42_R2.fastq"
## [65] "S43_R1.fastq" "S43_R2.fastq" "S44_R1.fastq" "S44_R2.fastq"
## [69] "S45_R1.fastq" "S45_R2.fastq" "S46_R1.fastq" "S46_R2.fastq"
## [73] "S47_R1.fastq" "S47_R2.fastq" "S48_R1.fastq" "S48_R2.fastq"
## [77] "S49_R1.fastq" "S49_R2.fastq" "S4_R1.fastq" "S4_R2.fastq"
## [81] "S50_R1.fastq" "S50_R2.fastq" "S51_R1.fastq" "S51_R2.fastq"
## [85] "S52_R1.fastq" "S52_R2.fastq" "S53_R1.fastq" "S53_R2.fastq"
## [89] "S54_R1.fastq" "S54_R2.fastq" "S55_R1.fastq" "S55_R2.fastq"
## [93] "S56_R1.fastq" "S56_R2.fastq" "S58_R1.fastq" "S58_R2.fastq"
## [97] "S59_R1.fastq" "S59_R2.fastq" "S5_R1.fastq" "S5_R2.fastq"
## [101] "S60_R1.fastq" "S60_R2.fastq" "S61_R1.fastq" "S61_R2.fastq"
## [105] "S62_R1.fastq" "S62_R2.fastq" "S63_R1.fastq" "S63_R2.fastq"
## [109] "S64_R1.fastq" "S64_R2.fastq" "S65_R1.fastq" "S65_R2.fastq"
## [113] "S66_R1.fastq" "S66_R2.fastq" "S67_R1.fastq" "S67_R2.fastq"
## [117] "S68_R1.fastq" "S68_R2.fastq" "S69_R1.fastq" "S69_R2.fastq"
## [121] "S6_R1.fastq" "S6_R2.fastq" "S72_R1.fastq" "S72_R2.fastq"
## [125] "S73_R1.fastq" "S73_R2.fastq" "S74_R1.fastq" "S74_R2.fastq"
## [129] "S75_R1.fastq" "S75_R2.fastq" "S76_R1.fastq" "S76_R2.fastq"
## [133] "S77_R1.fastq" "S77_R2.fastq" "S78_R1.fastq" "S78_R2.fastq"
## [137] "S79_R1.fastq" "S79_R2.fastq" "S7_R1.fastq" "S7_R2.fastq"
## [141] "S80_R1.fastq" "S80_R2.fastq" "S81_R1.fastq" "S81_R2.fastq"
## [145] "S8_R1.fastq" "S8_R2.fastq" "S9_R1.fastq" "S9_R2.fastq"
Vous devriez voir les noms des fichiers fastq.
You should see the names of the fastq files.
Nous allons maintenant lire les noms des fichiers fastq, et manipuler leur chaine de charactères variables pour obtenir une liste des fastq sens et antisens. La fonction sort permet d’obtenir le même ordre entre les fastq sens et antisens.
Now, we’re goign to read in the names of the fastq files, and perform some string manipulation to get lists of the forward and reverse fastq files. The sort function ensures forward/reverse reads are in the same order.
fnFs <- sort(list.files(path, pattern="_R1.fastq"))
fnRs <- sort(list.files(path, pattern="_R2.fastq"))
Etant donné que les paires de fichiers fastq sens et antisens appartiennent au même échantillon, nous allons créer une variable qui extrait le nom de cet échantillon. Dans ce cas, nous partons du principe que les noms des fichiers fastq ont un format: SAMPLENAME_XXX.fastq
Given that the forward/reverse fastq pairs belong to the same sample, we are going to extract the name and save it in a variable. In this case we assume that the filenames have this type of format: SAMPLENAME_XXX.fastq
sample.names <- sapply(strsplit(fnFs, "_"), `[`, 1)
sample.names
## [1] "S10" "S11" "S12" "S13" "S14" "S15" "S16" "S17" "S18" "S1" "S20"
## [12] "S21" "S22" "S23" "S24" "S25" "S26" "S27" "S28" "S29" "S2" "S30"
## [23] "S31" "S33" "S34" "S35" "S36" "S37" "S39" "S3" "S41" "S42" "S43"
## [34] "S44" "S45" "S46" "S47" "S48" "S49" "S4" "S50" "S51" "S52" "S53"
## [45] "S54" "S55" "S56" "S58" "S59" "S5" "S60" "S61" "S62" "S63" "S64"
## [56] "S65" "S66" "S67" "S68" "S69" "S6" "S72" "S73" "S74" "S75" "S76"
## [67] "S77" "S78" "S79" "S7" "S80" "S81" "S8" "S9"
Nous allons maintenant préciser le chemin d’accès aux objets fnFs et fnRS.
Specify the full path to the fnFs and fnRs.
fnFs <- file.path(path, fnFs)
fnRs <- file.path(path, fnRs)
Pour les plus maniaques d’entre nous, ce script permet d’ordonner les noms des échantillons.
For the maniacs among us, this script allows to order the names of the samples.
library(gtools)
sample.names <- mixedsort(sample.names)
fnFs <- mixedsort(fnFs)
fnRs <- mixedsort(fnRs)
sample.names
## [1] "S1" "S2" "S3" "S4" "S5" "S6" "S7" "S8" "S9" "S10" "S11"
## [12] "S12" "S13" "S14" "S15" "S16" "S17" "S18" "S20" "S21" "S22" "S23"
## [23] "S24" "S25" "S26" "S27" "S28" "S29" "S30" "S31" "S33" "S34" "S35"
## [34] "S36" "S37" "S39" "S41" "S42" "S43" "S44" "S45" "S46" "S47" "S48"
## [45] "S49" "S50" "S51" "S52" "S53" "S54" "S55" "S56" "S58" "S59" "S60"
## [56] "S61" "S62" "S63" "S64" "S65" "S66" "S67" "S68" "S69" "S72" "S73"
## [67] "S74" "S75" "S76" "S77" "S78" "S79" "S80" "S81"
Cette première étape permet de visualiser la qualité des séquences grâce au Q score associé à chaque nucléotide.
This first step allows to visualize the sequences quality thanks to the individual Q score of each nucleotide
plotQualityProfile(fnFs[1]) # 1st Forward sample
plotQualityProfile(fnRs[1]) # 1st Reverse sample
Dans ces figures, la médianne est en vert, et les quartiles en orange pointillé. Ici, nous avons choisi de visualiser le premier échantillon sens (fnFs[1]) et antisens (fnRs[1]), mais il est possible de visualiser plusieurs graphiques en même temps (fnFs[x:y]) ou les aggréger comme ce qui suit.
In these figures, the median is in green and the quartiles are the dotted orange lines. Here we only plotted the first forward and reverse fastq (fnFs[1] and fnRs[1]), but it is possible to plot multiple figures(fnFs[x:y]) or aggregate them as follows.
plotQualityProfile(fnFs, aggregate = TRUE)
plotQualityProfile(fnRs, aggregate = TRUE)
L’analyse de ces graphiques nous permet de choisir les paramètres de filtrage et de rognage de la prochaine étape. En effet, l’indice de Q score nous renseigne sur la précision du séquençage (voir tableau ci-dessous).
The analysis of these figures helps to choose the filtring and trimming parameters of the next step. The Q score index gives us information on sequencing’s accuracy (see table).
| Q score | Precision |
|---|---|
| 10 | 90 % |
| 20 | 99 % |
| 30 | 99.9 % |
| 40 | 99.99 % |
Tout d’abord nous allons créer un dossier (filtered_pairedend) et des objets (filtFs et filtRs) pour stoquer les séquences filtrées.
First we will create a directoy (filtered_pairedend) and objects (filtFs and filtRs) to store the filtered sequences.
filt_path <- file.path(path, "filtered_pairedend")
filtFs <- file.path(filt_path, paste0(sample.names, "_F_filt.fastq.gz"))
filtRs <- file.path(filt_path, paste0(sample.names, "_R_filt.fastq.gz"))
Procédons avec la fonction filterAndTrim, sa sortie sera stocké dans l’objet out.
Let’s procede with the filterAndTrim function, its output will be stored in the out object
out <- filterAndTrim(fnFs, filtFs, fnRs, filtRs,
truncQ=6,
truncLen = c(280,280),
trimLeft=c(18,20),
maxEE=c(2,2))
## Creating output directory: data/ITS_sub//filtered_pairedend
En premier lieu, la fonction a besoin des séquences non filtrées (fnFs et FnRs) ainsi que les noms des objets des séquences filtrées (filtFs et filtRs). Plusieurs paramètres peuvent ensuite être modifiés à notre guise:
D’autres paramètres peuvent également être modifiés, ils sont accessibles à la page d’aide de la fonction: ?filterAndTrim
First, the function needs the unfiltered sequences (fnFs and FnRs) as well as the names of the objects of the filtered sequences (filtFs and filtRs). Several parameters can then be modified as we wish:
Other settings can also be changed, they are accessible on the help page of the function : ?FilterAndTrim
pourc <- cbind((out[,2]/out[,1])*100) # Percentage filtered sequence / non-filtered sequence
pourc_disc <- cbind(out, pourc) # combines out and pourc
pourc_disc
## reads.in reads.out
## S1_R1.fastq 1000 595 59.5
## S2_R1.fastq 1000 482 48.2
## S3_R1.fastq 1000 556 55.6
## S4_R1.fastq 1000 425 42.5
## S5_R1.fastq 1000 387 38.7
## S6_R1.fastq 1000 619 61.9
## S7_R1.fastq 1000 521 52.1
## S8_R1.fastq 1000 423 42.3
## S9_R1.fastq 1000 492 49.2
## S10_R1.fastq 1000 366 36.6
## S11_R1.fastq 1000 432 43.2
## S12_R1.fastq 1000 546 54.6
## S13_R1.fastq 1000 417 41.7
## S14_R1.fastq 1000 229 22.9
## S15_R1.fastq 1000 512 51.2
## S16_R1.fastq 1000 481 48.1
## S17_R1.fastq 1000 634 63.4
## S18_R1.fastq 1000 607 60.7
## S20_R1.fastq 1000 614 61.4
## S21_R1.fastq 1000 598 59.8
## S22_R1.fastq 1000 539 53.9
## S23_R1.fastq 1000 561 56.1
## S24_R1.fastq 1000 599 59.9
## S25_R1.fastq 1000 598 59.8
## S26_R1.fastq 1000 567 56.7
## S27_R1.fastq 1000 650 65.0
## S28_R1.fastq 1000 607 60.7
## S29_R1.fastq 1000 526 52.6
## S30_R1.fastq 1000 551 55.1
## S31_R1.fastq 1000 571 57.1
## S33_R1.fastq 1000 482 48.2
## S34_R1.fastq 1000 644 64.4
## S35_R1.fastq 1000 558 55.8
## S36_R1.fastq 1000 453 45.3
## S37_R1.fastq 1000 446 44.6
## S39_R1.fastq 1000 521 52.1
## S41_R1.fastq 1000 517 51.7
## S42_R1.fastq 1000 447 44.7
## S43_R1.fastq 1000 500 50.0
## S44_R1.fastq 1000 369 36.9
## S45_R1.fastq 1000 359 35.9
## S46_R1.fastq 1000 540 54.0
## S47_R1.fastq 1000 501 50.1
## S48_R1.fastq 1000 391 39.1
## S49_R1.fastq 1000 516 51.6
## S50_R1.fastq 1000 422 42.2
## S51_R1.fastq 1000 567 56.7
## S52_R1.fastq 1000 521 52.1
## S53_R1.fastq 1000 599 59.9
## S54_R1.fastq 1000 472 47.2
## S55_R1.fastq 1000 571 57.1
## S56_R1.fastq 1000 416 41.6
## S58_R1.fastq 1000 525 52.5
## S59_R1.fastq 1000 559 55.9
## S60_R1.fastq 1000 593 59.3
## S61_R1.fastq 1000 537 53.7
## S62_R1.fastq 1000 470 47.0
## S63_R1.fastq 1000 644 64.4
## S64_R1.fastq 1000 595 59.5
## S65_R1.fastq 1000 561 56.1
## S66_R1.fastq 1000 551 55.1
## S67_R1.fastq 1000 544 54.4
## S68_R1.fastq 1000 532 53.2
## S69_R1.fastq 1000 372 37.2
## S72_R1.fastq 1000 558 55.8
## S73_R1.fastq 1000 581 58.1
## S74_R1.fastq 1000 431 43.1
## S75_R1.fastq 1000 356 35.6
## S76_R1.fastq 1000 403 40.3
## S77_R1.fastq 1000 387 38.7
## S78_R1.fastq 1000 400 40.0
## S79_R1.fastq 1000 560 56.0
## S80_R1.fastq 1000 457 45.7
## S81_R1.fastq 1000 594 59.4
(mean(out[,2])/mean(out[,1]))*100 # Mean percentage
## [1] 50.98243
CHALLENGE
Tracer le profil de qualité du premier échantillon une fois filtré et le comparer à son profil de qualité non-filtré. Utiliser la fonction plotQualityProfile.
Draw the quality profile of the first sample once filtered and compare it to its unfiltered quality profile. Use the plotQualityProfile function.
Cet étape consiste à estimer le taux d’erreur de séquençage. Son but est de permettre de différencier les séquences mutantes et les séquences érronées.Le modèle d’erreur est calculé en alternant l’estimation du taux d’erreur et l’inférence de la composition de l’échantillon
This step consist in estimating the error rates due to sequencing. Its purpose is to differentiate between mutant sequences and false sequences. The error model is computed by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution
errF <- learnErrors(filtFs)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 595 reads in 364 unique sequences.
## Sample 2 - 482 reads in 294 unique sequences.
## Sample 3 - 556 reads in 363 unique sequences.
## Sample 4 - 425 reads in 278 unique sequences.
## Sample 5 - 387 reads in 280 unique sequences.
## Sample 6 - 619 reads in 348 unique sequences.
## Sample 7 - 521 reads in 328 unique sequences.
## Sample 8 - 423 reads in 221 unique sequences.
## Sample 9 - 492 reads in 317 unique sequences.
## Sample 10 - 366 reads in 238 unique sequences.
## Sample 11 - 432 reads in 298 unique sequences.
## Sample 12 - 546 reads in 344 unique sequences.
## Sample 13 - 417 reads in 324 unique sequences.
## Sample 14 - 229 reads in 162 unique sequences.
## Sample 15 - 512 reads in 319 unique sequences.
## Sample 16 - 481 reads in 341 unique sequences.
## Sample 17 - 634 reads in 378 unique sequences.
## Sample 18 - 607 reads in 344 unique sequences.
## Sample 19 - 614 reads in 365 unique sequences.
## Sample 20 - 598 reads in 387 unique sequences.
## Sample 21 - 539 reads in 345 unique sequences.
## Sample 22 - 561 reads in 344 unique sequences.
## Sample 23 - 599 reads in 384 unique sequences.
## Sample 24 - 598 reads in 394 unique sequences.
## Sample 25 - 567 reads in 316 unique sequences.
## Sample 26 - 650 reads in 294 unique sequences.
## Sample 27 - 607 reads in 339 unique sequences.
## Sample 28 - 526 reads in 365 unique sequences.
## Sample 29 - 551 reads in 337 unique sequences.
## Sample 30 - 571 reads in 375 unique sequences.
## Sample 31 - 482 reads in 287 unique sequences.
## Sample 32 - 644 reads in 388 unique sequences.
## Sample 33 - 558 reads in 303 unique sequences.
## Sample 34 - 453 reads in 277 unique sequences.
## Sample 35 - 446 reads in 318 unique sequences.
## Sample 36 - 521 reads in 364 unique sequences.
## Sample 37 - 517 reads in 322 unique sequences.
## Sample 38 - 447 reads in 327 unique sequences.
## Sample 39 - 500 reads in 326 unique sequences.
## Sample 40 - 369 reads in 275 unique sequences.
## Sample 41 - 359 reads in 251 unique sequences.
## Sample 42 - 540 reads in 372 unique sequences.
## Sample 43 - 501 reads in 289 unique sequences.
## Sample 44 - 391 reads in 277 unique sequences.
## Sample 45 - 516 reads in 323 unique sequences.
## Sample 46 - 422 reads in 282 unique sequences.
## Sample 47 - 567 reads in 379 unique sequences.
## Sample 48 - 521 reads in 351 unique sequences.
## Sample 49 - 599 reads in 371 unique sequences.
## Sample 50 - 472 reads in 315 unique sequences.
## Sample 51 - 571 reads in 386 unique sequences.
## Sample 52 - 416 reads in 272 unique sequences.
## Sample 53 - 525 reads in 323 unique sequences.
## Sample 54 - 559 reads in 272 unique sequences.
## Sample 55 - 593 reads in 370 unique sequences.
## Sample 56 - 537 reads in 357 unique sequences.
## Sample 57 - 470 reads in 282 unique sequences.
## Sample 58 - 644 reads in 391 unique sequences.
## Sample 59 - 595 reads in 370 unique sequences.
## Sample 60 - 561 reads in 350 unique sequences.
## Sample 61 - 551 reads in 373 unique sequences.
## Sample 62 - 544 reads in 343 unique sequences.
## Sample 63 - 532 reads in 338 unique sequences.
## Sample 64 - 372 reads in 232 unique sequences.
## Sample 65 - 558 reads in 369 unique sequences.
## Sample 66 - 581 reads in 347 unique sequences.
## Sample 67 - 431 reads in 327 unique sequences.
## Sample 68 - 356 reads in 230 unique sequences.
## Sample 69 - 403 reads in 277 unique sequences.
## Sample 70 - 387 reads in 281 unique sequences.
## Sample 71 - 400 reads in 275 unique sequences.
## Sample 72 - 560 reads in 373 unique sequences.
## Sample 73 - 457 reads in 331 unique sequences.
## Sample 74 - 594 reads in 362 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## Convergence after 4 rounds.
## Total reads used: 37727
errR <- learnErrors(filtRs)
## Initializing error rates to maximum possible estimate.
## Sample 1 - 595 reads in 340 unique sequences.
## Sample 2 - 482 reads in 280 unique sequences.
## Sample 3 - 556 reads in 296 unique sequences.
## Sample 4 - 425 reads in 312 unique sequences.
## Sample 5 - 387 reads in 252 unique sequences.
## Sample 6 - 619 reads in 320 unique sequences.
## Sample 7 - 521 reads in 299 unique sequences.
## Sample 8 - 423 reads in 247 unique sequences.
## Sample 9 - 492 reads in 299 unique sequences.
## Sample 10 - 366 reads in 273 unique sequences.
## Sample 11 - 432 reads in 284 unique sequences.
## Sample 12 - 546 reads in 325 unique sequences.
## Sample 13 - 417 reads in 329 unique sequences.
## Sample 14 - 229 reads in 170 unique sequences.
## Sample 15 - 512 reads in 318 unique sequences.
## Sample 16 - 481 reads in 353 unique sequences.
## Sample 17 - 634 reads in 364 unique sequences.
## Sample 18 - 607 reads in 286 unique sequences.
## Sample 19 - 614 reads in 332 unique sequences.
## Sample 20 - 598 reads in 338 unique sequences.
## Sample 21 - 539 reads in 317 unique sequences.
## Sample 22 - 561 reads in 310 unique sequences.
## Sample 23 - 599 reads in 327 unique sequences.
## Sample 24 - 598 reads in 373 unique sequences.
## Sample 25 - 567 reads in 288 unique sequences.
## Sample 26 - 650 reads in 243 unique sequences.
## Sample 27 - 607 reads in 310 unique sequences.
## Sample 28 - 526 reads in 336 unique sequences.
## Sample 29 - 551 reads in 308 unique sequences.
## Sample 30 - 571 reads in 351 unique sequences.
## Sample 31 - 482 reads in 276 unique sequences.
## Sample 32 - 644 reads in 343 unique sequences.
## Sample 33 - 558 reads in 283 unique sequences.
## Sample 34 - 453 reads in 278 unique sequences.
## Sample 35 - 446 reads in 313 unique sequences.
## Sample 36 - 521 reads in 313 unique sequences.
## Sample 37 - 517 reads in 309 unique sequences.
## Sample 38 - 447 reads in 311 unique sequences.
## Sample 39 - 500 reads in 332 unique sequences.
## Sample 40 - 369 reads in 272 unique sequences.
## Sample 41 - 359 reads in 251 unique sequences.
## Sample 42 - 540 reads in 346 unique sequences.
## Sample 43 - 501 reads in 292 unique sequences.
## Sample 44 - 391 reads in 278 unique sequences.
## Sample 45 - 516 reads in 331 unique sequences.
## Sample 46 - 422 reads in 301 unique sequences.
## Sample 47 - 567 reads in 340 unique sequences.
## Sample 48 - 521 reads in 348 unique sequences.
## Sample 49 - 599 reads in 354 unique sequences.
## Sample 50 - 472 reads in 330 unique sequences.
## Sample 51 - 571 reads in 335 unique sequences.
## Sample 52 - 416 reads in 259 unique sequences.
## Sample 53 - 525 reads in 306 unique sequences.
## Sample 54 - 559 reads in 272 unique sequences.
## Sample 55 - 593 reads in 321 unique sequences.
## Sample 56 - 537 reads in 329 unique sequences.
## Sample 57 - 470 reads in 276 unique sequences.
## Sample 58 - 644 reads in 343 unique sequences.
## Sample 59 - 595 reads in 328 unique sequences.
## Sample 60 - 561 reads in 342 unique sequences.
## Sample 61 - 551 reads in 352 unique sequences.
## Sample 62 - 544 reads in 324 unique sequences.
## Sample 63 - 532 reads in 295 unique sequences.
## Sample 64 - 372 reads in 221 unique sequences.
## Sample 65 - 558 reads in 341 unique sequences.
## Sample 66 - 581 reads in 330 unique sequences.
## Sample 67 - 431 reads in 331 unique sequences.
## Sample 68 - 356 reads in 246 unique sequences.
## Sample 69 - 403 reads in 297 unique sequences.
## Sample 70 - 387 reads in 273 unique sequences.
## Sample 71 - 400 reads in 295 unique sequences.
## Sample 72 - 560 reads in 353 unique sequences.
## Sample 73 - 457 reads in 329 unique sequences.
## Sample 74 - 594 reads in 347 unique sequences.
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## Convergence after 4 rounds.
## Total reads used: 37727
Le nombre minimum de séquences à utiliser pour l’apprentissage des taux d’erreur peut être précisé avec le paramètre nreads.
The minimum number of sequences to use for error rate learning can be specified with the nreads parameter.
plotErrors(errF, nominalQ=TRUE)
plotErrors(errR, nominalQ=TRUE)
Les taux d’erreur pour chaque transition (A->C, A->G,…) sont affichés. Chaque point est un taux d’erreur observé pour chaque score de qualité consensuel. La ligne noire montre l’erreur après convergence. La ligne rouge montre l’erreur sous la définition nominale de la valeur Q (pour la technologie Illumina).
The error rates for each possible transition (eg. A->C, A->G, …) are shown. Points are the observed error rates for each consensus quality score. The black line shows the estimated error rates after convergence. The red line shows the error rates expected under the nominal definition of the Q-value (for Illumina technology).
The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads.
Dans cette étape, toutes les séquences identiques vont être regroupées en séquences uniques auxquelles sont attribués des abondances. Cela va diminuer les temps de calcul subséquants en éliminant des comparaisons redondantes. Les séquences dérépliquées prennent le nom des échantillons d’où elles proviennent.
Combines all identical sequencing reads into into unique sequences with a corresponding abundance. It will reduce subsequent computation time by eliminating redundant comparisons. The dereplicated sequences take the name of the samples from which they come.
derepFs <- derepFastq(filtFs)
names(derepFs) <- sample.names
derepRs <- derepFastq(filtRs)
names(derepRs) <- sample.names
L’avantage de DADA2 réside dans la conservation d’un résumé des informations de qualité associées à chaque séquence unique. Le profil de qualité consensuel d’une séquence unique est la moyenne des qualités de position des lectures dérépliquées. Ces profils de qualité informent le modèle d’erreur de l’étape suivante d’inférence d’échantillon, ce qui augmente considérablement la précision de DADA2.
The advantage of DADA2 lies in the fact that it retains a summary of the quality information associated with each unique sequence. The consensus quality profile of a unique sequence is the average of the positional qualities from the dereplicated reads. These quality profiles inform the error model of the subsequent sample inference step, significantly increasing DADA2’s accuracy.
dadaFs <- dada(derepFs,
err = errF,
pool=TRUE,
multithread=TRUE)
## 74 samples were pooled: 37727 reads in 20364 unique sequences.
dadaRs <- dada(derepRs,
err=errR,
pool=TRUE,
multithread=TRUE)
## 74 samples were pooled: 37727 reads in 18908 unique sequences.
dadaFs[[1]] # 1st forward sample
## dada-class: object describing DADA2 denoising results
## 76 sample sequences were inferred from 364 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE
dadaRs[[1]] # 1st reverse sample
## dada-class: object describing DADA2 denoising results
## 77 sample sequences were inferred from 340 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, BAND_SIZE = 16, USE_QUALS = TRUE
#save(dadaRs, file="data/dadaRs.rdata")
#save(dadaFs, file="data/dadaFs.rdata")
Tout l’intérêt du séquençage en paire réside dans le but de fusionner les deux brins afin d’accroitre notre confiance en leur fiabilité. La fusion permet également d’obtenir des amplicons plus long.
La fonction mergePairs nécessite qu’on lui fournisse les objets calculés dans les deux étapes précédentes (derep et dada). Puis, les paramètres que nous pouvons modifiés librement sont:
The whole point of paired-end sequencing lies in the goal of merging the two strands to increase our confidence in their reliability. Merging also makes it possible to obtain longer amplicons.
The function mergePairs needs to be provided with the objects computed in the two preceding stages (derep and dada). Then, the parameters we can freely modify are:
Other settings can also be changed, they are accessible on the help page of the function : ?mergePairs. For example, if returnRejects = TRUE, pairs that were rejected because of mismatches in the overlap region are kept in the output.
mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs,
minOverlap = 12,
maxMismatch = 0)
head(mergers[[1]])
## sequence
## 1 ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACCTTGCGCCCCTTGGTATTCCGAGGGGCACACCCGTTTGAGTGTCGTGAATACTCTCAACCTTCTTGGTTTCTTTGACCACGAAGGCTTGGACTTTGGAGGTTTTTCTTGCTGGCCTCTTTAGAAGCCAGCTCCTCCTAAATGAATGGGTGGGGTCCGCTTTGCTGATCCTCGACGTGATAAGCATCTCTTCTACGTCTCAGTGTCAGCTCGGAACCCCGCTTTCCAACCGTCTTTGGACAAAGACAATGTTCGAGTTGCGACTCGACCTTACAAACCTTGACCTCAAATCGGGTGAGACTACCCGCTGAACTTAA
## 2 ATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATCCCGAGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTGGCTTGGTGTTGGGCGACGTCCCCTTTTGGGGACGCGTCTCGAAACGCTCGGCGGCGTGGCACCGGCTTTAAGCGTAGCAGAATCTTTCGCTTTGAAAGTCGGGGCCCCGTCTGCCGGAAGACCTACTCGCAAGGTTGACCTCGGATCAGGCAGGGATACCCGCTGAACTTAA
## 3 ATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCGAGGGGCATGCCTGTTCGAGCGTCATTATAACCACTCAAGCCCCGGCTTGGTCTTGGGGTTCGCGGTCCGCGGCCCTTAAACTCAGTGGCGGTGCCGTCTGGCTCTAAGCGCAGTAATTCTCTCGCTATAGTGTCTAGGTGGTTGCTTGCCATAATCCCCCAATTTTTTACGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA
## 4 ATGCGATACGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCACTCCTTGGTATTCCGAGGAGTATGCCTGTTTCAGTATCATGAGCACTCTCACACCTAACCTTTGGGTTTATGGCGTGGAATTGGAATGCGCCGACTGTCATGGTTGGCCCTTCTAAAATGTAGTTCTTGGCTGTCACCTAATACAGCAGTTTGGCCTAATAGTTTTGGCATTCATTGTCAAATCTTTGGCTAACATTTGCTCCAGGAGTCAGTCTTGATAATACAGAAAACTCATTCAAATTTTGATCTGAAATCAGGTAGGGCTACCCGCTGAACTTAA
## 5 ATGCGATACGTAATGTGAATTGCAGAATTCCGTGAATCATTGAATCTTTGAACGCATCTTGCGCCTCTTGGTATTCCGAGGGGCATGCCTGTTTGAGTGTCATTAGAACTATCAAAAAAATAGATGATTTCAATCGTTAATTTTTTTGGAATTGGAGGTGGTGCTGGTCTTTTTCCATTAATGGCCCAAGCTCCTCCGAAATGCATTAGCGAATGCAGTGCACTTTTTCTCCTTGCTTTTTCTGGGCATTGATAGTTTACTCTCATGCCCTAAGCTGGTAGGGAGGAAGTCACAGAATGCTTCCCGCTCCTGAATGTAATACAAAACTTGACGATCAAACCCCTCAAATCAGGCAGGACTACCCGCTGAACTTAA
## 6 ATGCGATAAGTAATGCGAATTGCAGAATTCAGTGAGTCATCGAATCTTTGAACGCATATTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTCGAGCGTCATGATCAACCATCAAGCCTGGCTTGTCGTTGGACCCTGTTGTCTCTGGGCGACAGGTCCGAAAGATAATGACGGTGTCATGGCAACCCCGAATGCAACGAGCTTTTTTATAGGCACGCATTTAGTGGTTGGCAAGGCCCCCTCGTGCGTTATTATTTTCTTACGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA
## abundance forward reverse nmatch nmismatch nindel prefer accept
## 1 172 15 12 150 0 0 2 TRUE
## 2 43 8 9 230 0 0 1 TRUE
## 3 37 22 18 234 0 0 2 TRUE
## 4 34 3 2 179 0 0 2 TRUE
## 5 32 4 3 147 0 0 1 TRUE
## 6 21 12 10 218 0 0 1 TRUE
max(mergers[[1]]$nmatch) # Largest overlap
## [1] 257
min(mergers[[1]]$nmatch) # Smallest overlap
## [1] 113
ASV signifie variants de séquences d’amplicons. C’est le nom donné aux OTUs de DADA2. Cependant ils sont bien différents des OTUs car ils n’ont à aucun moment été regroupés (à part les séquences identiques).
ASV stands for amplicon sequence variant. It is the name given to DADA2’s OTUs. However, they are very different from OTUs because they have never been clustered together (except for identical sequences).
seqtab <- makeSequenceTable(mergers)
## The sequences being tabled vary in length.
seqtab[,1]
## S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12 S13 S14 S15 S16 S17 S18
## 3 5 2 2 2 1 1 0 3 0 2 0 2 3 1 48 206 129
## S20 S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S33 S34 S35 S36 S37 S39
## 1 1 1 1 3 2 0 532 1 7 81 11 2 0 1 1 2 9
## S41 S42 S43 S44 S45 S46 S47 S48 S49 S50 S51 S52 S53 S54 S55 S56 S58 S59
## 4 2 1 4 6 3 0 5 0 2 6 4 2 2 3 0 0 0
## S60 S61 S62 S63 S64 S65 S66 S67 S68 S69 S72 S73 S74 S75 S76 S77 S78 S79
## 2 4 9 0 1 5 1 0 3 4 2 4 5 0 3 11 2 8
## S80 S81
## 9 1
dim(seqtab)
## [1] 74 551
Nous obtenons 552 ASVs à partir des 74 000 séquences brutes que nous avions au début.
We get 552 ASVs from the 74 000 raw sequences we had at the beginning.
hist(nchar(getSequences(seqtab)),xlab="Size", ylab="Frequency", main = "ASVs length", xlim=c(250,450), ylim=c(0,250))
Cette étape vise à éliminer toutes les séquences non biologiques, les échantillons du tableau des ASV sont regroupés (method=“pooled”) pour l’identification. D’autres méthodes peuvent être utilisées comme la méthode consensus où chaque échantillon est vérifié individuellement pour identifier les bimères.
This step aims at removing all non-biological sequences, the samples in the ASV table are all pooled together for bimera identification. Other methods can be used like the consensus method where samples are checked individually for bimeras.
seqtab.nochim <- removeBimeraDenovo(seqtab,
method = "pooled",
multithread = TRUE,
verbose = TRUE)
## Identified 5 bimeras out of 551 input sequences.
#save(seqtab.nochim, file="data/seqtab.nochim.rdata")
round((sum(seqtab.nochim)/sum(seqtab)*100),2) # Percentage of the total sequence reads
## [1] 99.41
hist(nchar(getSequences(seqtab.nochim)),xlab="Size", ylab="Frequency", main = "ASVs length", xlim=c(250,450), ylim=c(0,250)) # Lenght of the non-chimeric sequences
Nous pouvons également transformer les occurences des ASVs en présence / absence grâce au code çi-dessous. Cela permet de quantifier le nombre d’ASVs par échantillons.
We can also transform the ASVs occurrences in presence / absence thanks to the code below. It makes it possible to quantify the number of ASVs per sample.
seqtab.nochim.bin <- ifelse(seqtab.nochim>0,1,0)
Le tableau qui suit permet de voir combien de séquences ont été éliminées à chaque étape. Une chute trop grande du nombre de séquences peut indiquer un problème. Par exemple, si peu de séquences passent l’étape de la suppresion des bimères cela peu indiquer qu’il reste des bouts d’amorces avec de nucléotides ambigües.
getN <- function(x) sum(getUniques(x))
track <- cbind(out, # input & filtered
round(((out[,2]/out[,1])*100),2), # % (filtered / input)
sapply(mergers, getN), # Merged
round(((sapply(mergers, getN)/out[,1])*100),2), # % (Merged / Input)
rowSums(seqtab.nochim),# Non-chimeric
round(((rowSums(seqtab.nochim)/out[,1])*100),2),# % (Non-chimeric / Input)
rowSums(seqtab.nochim.bin)) # Number of ASVs per sample
colnames(track) <- c("Input",
"Filter",
"%Filt/In",
"Merge",
"%Mer/In",
"Non-chim",
"%Non-chim/In",
"#ASV/sample") # Column names
rownames(track) <- sample.names # Row names
track<-as.data.frame(track)
head(track)
## Input Filter %Filt/In Merge %Mer/In Non-chim %Non-chim/In #ASV/sample
## S1 1000 595 59.5 576 57.6 576 57.6 68
## S2 1000 482 48.2 431 43.1 427 42.7 37
## S3 1000 556 55.6 481 48.1 481 48.1 30
## S4 1000 425 42.5 373 37.3 372 37.2 52
## S5 1000 387 38.7 328 32.8 327 32.7 59
## S6 1000 619 61.9 597 59.7 597 59.7 24
Nous voyons enfin le bout du pipeline avec cette importante étape d’assignation taxonomique. Grâce à l’implémentation de la méthode de classification naïve bayésienne, la fonction assignTaxonomy prend en entrée l’ensemble de séquences à classer, et un ensemble de séquences de référence avec une taxonomie connue pour émettre des assignations taxonomiques avec une confiance minimale de bootstrap spécifié avec le paramètre minBoot. La base de données de références (UNITE) est accessible sur ce lien https://unite.ut.ee/repository.php
We are finally going to the end of the pipeline with this important step of taxonomic assignment. Thanks to the implementation of the naïve Bayesian classification method, the assignTaxonomy function takes as input all the sequences to be classified, and a set of reference sequences with known taxonomy to issue taxonomic assignments with a minimum bootstrap confidence specified with the minBoot parameter. The database of references (UNITE) is accessible on this link https://unite.ut.ee/repository.php
# taxotab <- assignTaxonomy(seqtab.nochim,
# refFasta = # "reference_database/sh_general_release_dynamic_01.12.2017.fasta",
# minBoot = 50, #Default 50. The minimum bootstrap confidence for # assigning a taxonomic level.
# outputBootstraps = TRUE)
#
# View(taxotab)
# unique(unname(taxotab$tax[,7])) # Number of unique species
# save(taxtab, file="data/taxotab.01.05.rdata")
Nous obtenons XX espèces différentes.
We obtained XX different species
CHALLENGE
Pouvez-vous trouvez le nombre de famille différentes?
Can you find the number of different families?
NB1: dada2 does not throw away singleton reads. However, it does not infer biological sequence variants that are only supported by a single read - singletons are assumed too difficult to differentiate from errors. Hence no singletons in the output table of amplicon sequence variants.
NB2: DADA2 consistently measures diversity across different filtering parameters and error rates. OTU methods do not. https://twitter.com/bejcal/status/771010634074820608
NB3: The ASVs with no species assignment do not match the same species in over 50% of the bootstrap replicate kmer-based assignments (see Wang et al. 2007 for more info on the naive Bayesian classifier method).